Machine Learning in Finance

Module 3

Matthew G. Son

University of South Florida

Review

Unsupervised Learning

Unsupervised learning: clustering, association, dimensionality reduction

In Finance : Clustering & Dimensionality reductions

  • K-means clustering

  • Hierarchical clustering

  • Principal Component Analysis (PCA)

Clustering in Finance

Banking

Suppose you are a bank and have hundreds of thousands of customers and more than 100 features describing each

  • Unsupervised learning algorithms can be used to divide your customers into clusters

  • To anticipate their needs

  • Communicate more effectively

Dimensionality Reduction in Finance

Banking

Suppose you are a bank and have hundreds of thousands of customers and more than 100 features describing each

  • Or, you can reduce the features needed from 100 to 15 features

  • Removes redundancy and efficiency for further analysis

Machine Learning concepts

Feature scaling

Feature scaling is a technique to standardize the range of independent variables (X, or features).

Common methods are:

  1. Z-score scaling (standardization)
  2. Min-max scaling

Why feature scaling?

Not all algorithms require feature scaling. Some algorithms are especially susceptible to scale effect.

Issues caused by improper scaling:

  1. Poor performance
  2. Longer training times
  3. Misleading feature importance

Feature scaling

  1. Z-score scaling:

\[ Value \rightarrow \frac{Value - Mean}{SD}\]

  1. Min-Max scaling:

\[ Value \rightarrow \frac{Value - Minimum}{Maximum - Minimum}\]

Exercise

You are given the following vector of daily returns (in decimal form) for a stock:

daily_returns <- c(0.02, -0.01, 0.03, -0.02)

Q1. Scale using Z-score scaling.

\[ Value \rightarrow \frac{Value - Mean}{SD}\]

Q2. Scale using Min-Max scaling.

\[ Value \rightarrow \frac{Value - Minimum}{Maximum - Minimum}\]

Euclidean Distance

Calculate straight line distance between two points

  • In two dimension space, \[ d(A,B) = \sqrt{(x_B -x_A)^2 + (y_B - y_A)^2} \]

Pen and Paper Exercise

Euclidean Distance

Consider two points in a two-dimensional space:

A(2, 3) and B(7, 11)

Calculate the Euclidean distance d(A,B) between these two points using the formula:

\[ d(A,B) = \sqrt{(x_B - x_A)^2 + (y_B - y_A)^2} \]

Verify

A <- c(2, 3)
B <- c(7, 11)
d <- function(a, b) {
  return(sqrt(sum((b - a)^2)))
}
d(A, B)
[1] 9.433981

Centroid

The centroid of a cluster is the average of all points in that cluster

  • In 2D, it is the mean of all \(x\) and \(y\) coordinates: \[ \textrm{Centroid} = \left( \frac{1}{n} \sum_{i=1}^n x_i,\ \frac{1}{n} \sum_{i=1}^n y_i \right) \]

  • Acts as the center or representative point of the cluster

Pen and Paper Exercise

Centriod

Consider two points in a two-dimensional space:

A(2, 3) and B(7, 11)

Calculate the centroid.

Within cluster Sum of Squares (WSS)

Within cluster Sum of Squared (WSS) for cluster \(j\) is:

\[ WSS_j = \sum\limits_{i=1}^nd_i^2 \]

  • \(d_i^2\) is squared distance of \(i\) to centroid
  • Measure how “sparse” the cluster is
  • Higher the sparser

Pen and Paper Exercise

Within Cluster Sum of Squares (WSS)

A cluster has three data points:

P_1(1, 2), P_2(2, 4), and P_3(3, 6).

The centroid of this cluster is given as C(2, 4).

Calculate the WSS for this cluster, defined as:
\[ WSS = \sum_{i=1}^{n} d(P_i, C)^2 \]

Verify

P1 <- c(1, 2)
P2 <- c(2, 4)
P3 <- c(3, 6)

C <- c(2, 4)

wss <- sum(d(P1, C)^2, d(P2, C)^2, d(P3, C)^2)

Total WSS / Inertia

Inertia is the Total “within cluster sum of squares (WSS)” of distance to centroid:

\[ \mathrm{Inertia} = \sum\limits_{j=1}^K\mathrm{WSS}_j \]

  • To evaluate quality of clusters formed by K-means

  • Similar to error metric for K-means (we try to minimize it)

Pen and Paper Exercise

Inertia

Suppose you have three clusters with the following within-cluster sums of squares (WSS):

  • Cluster 1: WSS_1 = 100
  • Cluster 2: WSS_2 = 150
  • Cluster 3: WSS_3 = 200

How do you calculate the inertia for three clusters?

K-means

K-means

A popular clustring algorithm in Finance.

A Partitioning method that clusters dataset into k distinct subsets

Algorithm:

  • Find k cluster centroids that minimize distances within cluster data points
  • Initiate random starting centroids and find the optimal points

Visual Summary

Algorithm Summary

  1. Initialization:
  • Choose the number of clusters, K.

  • Randomly select K data points (without replacement) to serve as the initial centroids.

  1. Assignment:
  • Assign each data point to the nearest centroid. The result is K clusters of data points.
  1. Update:
  • For each of the K clusters, compute the new centroid (mean) of all the data points assigned to that cluster.
  1. Convergence Check:
  • If the centroids do not change (or change very little) then stop. Otherwise, return to the “Assignment” step.

Algorithm Summary (Detailed)

Input:

  • \(X = \{x_1, x_2, …, x_n\}\): a set of data points (each observation, \(n\) toal observations)

  • \(K\) : number of clusters

Output:

  • A set of clusters \(C = \{C_1, C_2, …, C_K\}\)

Algorithm:

  1. Initialize

    • Randomly select \(K\) points from \(X\) as centroids: \(\mu_1, \mu_2, …, \mu_K\)
  2. Repeat until convergence

    • For each \(x_i \in X\): find the nearest centroid \(\mu_j = argmin_{j\in{1,…,K}} \ d(x_i, \mu_j)\)

    • Assign \(C_j \leftarrow x_i\)

    • For each cluster \(C_j\) : update centroid, \(\mu_j = \frac{1}{|C_j|}\sum\limits_{x_i\in C_j} x_i\)

  3. Termination

    • Terminate if \(\Delta\mu_j = \sqrt{\sum(\mu_{j,i}^t - \mu_{j,i}^{t-1})^2} < \epsilon\)

How many Ks?

Choosing the appropriate number of clusters (K) is crucial:

  • It significantly influences the outcomes of your clustering

Then, how can we choose K?

How to choose Ks?

  1. Heuristics: if strong prior exists (i.e. theory, regulations)

  2. With statistical methods:

  • Elbow method
  • Silhouette Method
  • Gap statistic

Elbow method

Calculate Inertia (i.e. total within-cluster sum of squares, WSS):

  • For each values of K from small to large
  • Pick a point when reduction in Inertia slows down

Silhouette Method

Silhouette score measures how close each point \(i\) in one cluster is to the points in neighboring clusters.

\[ S_i = \frac{b_i - a_i}{\max{[a_i, b_i]}} \]

-   $a_i$: $i$'s average distance within cluster points
-   $b_i$: $i$'s average distance with neighbor points (K-1), then pick minimum
  • Calculate \(S_i\) for all points and get average for each K
  • Pick K with highest score

Gap statistic

Compares inertia for different values of K with random clustering.

Idea: the compactness of clustering should be better than that of random clustering

\[ Gap(k) = m_k - w_k \]

  • \(m_k\): log mean of inertia from randomly created data (bootstrapping, B times)
  • \(w_k\): inertia based on K-means

Choose \(k\) where

\[ Gap(k) >= Gap(k+1) - s_{k+1} \]

  • \(s_k\): standard deviation of log of inertia of bootstrapping

Caveats of K-means

Curse of Dimensionality

The concept of “distance” becomes less meaningful in large dimension

  • the variation in distance become less as dimension increases
  • making all points appear almost equidistant

Distance distribution

1000 Randomized samples with varing dimensions 1 to 500

Prone to Seed

K-means are sensitive to initialization (seed selection)

How to solve seed problem?

  • Select good seeds using a heuristic (theory)
  • Fix a random seed for reproducibility
  • Try out multiple starting points: (cross) validation

Validation on K-means

Though K-means is unsupervised, Inertia can be used as if an error metric as in supervised learning. (h2o provides this feature.)

  • Train K-means model and use the model to cluster on validation set
  • Check variabiliy of inertia with (cross) validation
  • If inertia is stable: not prone to seed

H2O

H2O ML

Open-source, in-memory platform for distributed, scalable machine learning

  • Integrates with big data infrastructure (e.g., Hadoop and Spark)

  • Provides a broad ML algorithms

  • AutoML : Automatic model training

  • High-Performance: Written in Java, robust and fast for large datasets

  • Multi-Language Support: Accessible via R, Python, Java, and Scala

  • Web API & GUI: Offers an interactive web interface for model monitoring and management

Note

H2O’s R API syntax mostly follows the base R.

Most of dplyr verbs doesn’t directly work on H2O Frame.

Prep

  1. java needes to be installed (java or openjdk 11 recommended, or after)
  • You can find from installation guide on Canvas.
  1. Install package:
install.packages("h2o")
install.packages("cluster") # for silhouette
install.packages("GGally")

K-means Lab Walkthrough

K-means with H2O

In the lab walkthrough, we will use h2o to perform K-Means clustering. The workflow includes:

  1. Initializing H2O

  2. Importing or converting data into H2O frames

  3. Training ML models using h2o algorithms

  4. Evaluating and interpreting results

  5. Exporting predictions or models

Import data

library(tidyverse)
library(cluster) # needed for silhouette

country_risk <- read_csv("ml_data/Country Risk 2019 Data.csv")
# Colume GDP Growth has whitespace
country_risk |>
  head(4)
# A tibble: 4 × 6
  Country   Abbrev Corruption Peace Legal `GDP Growth`
  <chr>     <chr>       <dbl> <dbl> <dbl>        <dbl>
1 Albania   AL             35  1.82  4.55         2.98
2 Algeria   DZ             35  2.22  4.43         2.55
3 Argentina AR             45  1.99  5.09        -3.06
4 Armenia   AM             42  2.29  4.81         6   

Clean data

country_risk <- country_risk |>
  janitor::clean_names() # clean names

country_risk |>
  head(4)
# A tibble: 4 × 6
  country   abbrev corruption peace legal gdp_growth
  <chr>     <chr>       <dbl> <dbl> <dbl>      <dbl>
1 Albania   AL             35  1.82  4.55       2.98
2 Algeria   DZ             35  2.22  4.43       2.55
3 Argentina AR             45  1.99  5.09      -3.06
4 Armenia   AM             42  2.29  4.81       6   

Variable Definitions

  1. Real GDP growth (from IMF)
    • Higher the better
  2. Corruption Index (Transparency International)
    • Higher the better (no corruption)
  3. Peace Index (Institute for Economics and Peace)
    • Lower the better (very peaceful)
  4. Legal risk index (Property Rights Association)
    • Higher the better (favorable)

Country Risk data

Evaluating country-level risk is important in international finance:

  • For portfolio diversification
  • Currency evaluation
  • Bond risk evaluation

Challenge is, there are so many nations!

  • Let’s cluster them into groups with similar characteristics

Variable Correlations

Pairwise plots to explore variable relationships:

  • Corruption and Legal have strong correlations
  • GDP Growth is distinct from others
library(GGally)
ggpairs(
  country_risk |>
    select(-country, -abbrev)
)

How many nations?

country_risk |>
  select(country) |>
  n_distinct()
[1] 121

Initiate h2o

Initiate a h2o java server locally with h2o.init().

Tip

If there is an external server running, you can connect to the server instead with h2o.init(). Check the documentations.

library(h2o)
h2o.init() # initiate h2o server locally

Upload to H2O

Since the H2O server is running separately, you need to upload the data to the server.

If you have a data.frame in R, send the data to H2O with as.h2o()

country_risk_h2o <- as.h2o(
  country_risk
)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

A dry run with K = 3

Let’s dry run a kmeans algorithm with K = 3. We’ll expand our analysis with elbow method to determine optimal K.

initial_kmeans_model <-
  h2o.kmeans(
    training_frame = country_risk_h2o,
    x = c("corruption", "peace", "legal", "gdp_growth"),
    k = 3,
    standardize = TRUE,
    seed = 123 # random initial point seed
  )
h2o.tot_withinss(initial_kmeans_model) # get inertia

Clustering result

Suppose we wanted to use K=3 to cluster our data. We make model predictions with the k-means model.

country_risk_h2o[["cluster"]] <- h2o.predict(
  initial_kmeans_model, # Model
  country_risk_h2o # Data to make prediction
)
print(country_risk_h2o)

Elbow method

The elbow method helps us choose a good value for K (the number of clusters) by:

  1. Trying several values of K (e.g., K = 1 to 20).

  2. Measuring how well the clusters fit the data (using the Total Within-Cluster Sum of Squares, or WSS).

  3. Plotting WSS vs K.

  4. Looking for an “elbow” — the point where adding more clusters stops giving big improvements.

Define function

get_inertia <- function(k, data = country_risk_h2o) {
  km_model <- h2o.kmeans(
    training_frame = data,
    x = c("corruption", "peace", "legal", "gdp_growth"),
    k = k, # specified from function
    standardize = TRUE,
    seed = 123
  )
  return(h2o.tot_withinss(km_model))
}

Summary table with map

# Loop over k values
k_values <- 1:20
elbow_results_h2o <- tibble(
  K = k_values,
  Inertia = map_dbl(k_values, get_inertia)
)

Visualize Elbow

elbow_results_h2o |>
  ggplot(aes(x = K, y = Inertia)) +
  geom_point() +
  geom_line() +
  theme_bw()

Clustering result

Let’s use K = 6 for our final model.

k6_model <- h2o.kmeans(
  training_frame = country_risk_h2o,
  x = c("corruption", "peace", "legal", "gdp_growth"),
  k = 6,
  standardize = TRUE,
  seed = 123
)

# Get cluster assignments
groups <- as.vector(h2o.predict(k6_model, country_risk_h2o))
# Assign cluster results on R dataframe
country_risk <- country_risk |>
  mutate(cluster = groups)
head(country_risk)
# A tibble: 6 × 7
  country   abbrev corruption peace legal gdp_growth cluster
  <chr>     <chr>       <dbl> <dbl> <dbl>      <dbl>   <int>
1 Albania   AL             35  1.82  4.55       2.98       0
2 Algeria   DZ             35  2.22  4.43       2.55       4
3 Argentina AR             45  1.99  5.09      -3.06       4
4 Armenia   AM             42  2.29  4.81       6          0
5 Australia AU             77  1.42  8.36       1.71       3
6 Austria   AT             77  1.29  8.09       1.60       3

Analysis on Cluster Groups

Let’s skim the first 2 observations of each group.

country_risk |>
  group_by(cluster) |>
  slice(1:2)
# A tibble: 12 × 7
# Groups:   cluster [6]
   country   abbrev corruption peace legal gdp_growth cluster
   <chr>     <chr>       <dbl> <dbl> <dbl>      <dbl>   <int>
 1 Albania   AL             35  1.82  4.55      2.98        0
 2 Armenia   AM             42  2.29  4.81      6           0
 3 Iran      IR             26  2.54  4.58     -9.46        1
 4 Nicaragua NI             22  2.31  4.34     -5.04        1
 5 Burundi   BI             19  2.52  3.80      0.419       2
 6 Cameroon  CM             25  2.54  4.31      4.00        2
 7 Australia AU             77  1.42  8.36      1.71        3
 8 Austria   AT             77  1.29  8.09      1.60        3
 9 Algeria   DZ             35  2.22  4.43      2.55        4
10 Argentina AR             45  1.99  5.09     -3.06        4
11 Botswana  BW             61  1.68  5.96      3.48        5
12 Chile     CL             67  1.63  6.88      2.52        5

Visualize summary statistics on each cluster group. How do you interpret the grouping results?

country_risk |>
  group_by(cluster) |>
  summarize(across(!c(country, abbrev), mean))
# A tibble: 6 × 5
  cluster corruption peace legal gdp_growth
    <int>      <dbl> <dbl> <dbl>      <dbl>
1       0       37.7  2.01  5.00       5.32
2       1       24    2.44  4.22      -7.19
3       2       26.6  2.84  4.28       2.50
4       3       80.5  1.42  8.18       1.48
5       4       37.2  2.12  5.18       1.54
6       5       59.6  1.75  6.65       2.43
country_risk |>
  group_by(cluster) |>
  summarize(across(!c(country, abbrev), mean)) |>
  pivot_longer(-cluster) |>
  ggplot(aes(x = as.factor(cluster), y = value, fill = name)) +
  geom_col(position = "dodge", width = 0.5) +
  labs(
    x = "Group",
    y = "Index Value",
    fill = "Category",
    title = "Cluster Profiles (K=6)"
  ) +
  theme_bw()

Appendix: Silhouette Method

Silhouette score can be computed using cluster::silhouette(). It requires:

  • Cluster assignments (we use H2O)
  • Distance matrix
library(cluster)

get_silhouette <- function(k) {
  # Fit k-means using H2O
  km_model <- h2o.kmeans(
    training_frame = country_risk_h2o,
    x = c("corruption", "peace", "legal", "gdp_growth"),
    k = k,
    standardize = TRUE,
    seed = 123
  )

  # Step 2: Get cluster assignments as a vector
  clusters <- as.vector(h2o.predict(km_model, country_risk_h2o))

  # Step 3: Compute distance matrix
  dist_matrix <- country_risk |>
    select(where(is.numeric)) |>
    dist()

  # Step 4: Compute silhouette scores
  sil <- silhouette(clusters, dist_matrix)

  # Step 5: Return average silhouette score for k
  return(mean(sil[, 3]))
}

# Create summary table
silhouette_summary <- tibble(
  K = 2:20,
  Avg_sil = map_dbl(2:20, get_silhouette)
)

Visualize Silhouette score for each K

silhouette_summary |>
  ggplot(aes(x = K, y = Avg_sil)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(n.breaks = 10) +
  theme_bw()

Homework Exercise

K-means on mtcars

mtcars data is available in R. Load it with data(mtcars).

Use all numeric variables for clustering.

  1. Perform elbow (and silhouette) method to determine best k.

  2. With your choice of k, describe the clusters in detail including visualizations.

Report in Quarto .html.

Homework Reading

Homework Reading

John C. Hull “Machine Learning in Business”

  • Chapter 2